feat: parallelize container monitor dependency requests#6757
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
Benchmark —
|
| Configuration | Wall-clock | vs sequential |
|---|---|---|
sequential (snyk v1.1303.2, prior code) |
481 s (8m 1s) | 1.00× |
SNYK_REQUEST_CONCURRENCY=5 (PR B) |
104 s | 4.6× faster |
default c=10 (PR B) |
54 s | 8.9× faster |
SNYK_REQUEST_CONCURRENCY=20 (PR B) |
29 s | 16.6× faster |
Each PR-B run successfully updated all 512 snapshots. The sequential run produced 510 snapshots; possibly minor version drift between v1.1303.2 and the PR binary's bundled snyk-docker-plugin. No errors logged; doesn't affect the timing comparison.
| const MAX_REQUEST_CONCURRENCY = 50; | ||
|
|
||
| /** | ||
| * Returns the maximum number of in-flight Snyk dependency-test or |
There was a problem hiding this comment.
Do you know if these APIs rate limit? Could we run into this if a customer (like Walmart) sets this to 50?
There was a problem hiding this comment.
Good question. I am worried about that. I'm not sure how the rate limiting occurs (e.g., dest endpoint, source IP, etc.). I need to look into this. I'm hoping going from 5->10 as a default is relatively safe, but I don't think we can do more without some testing.
There was a problem hiding this comment.
Rate-limit investigation summary
I dug into the api-gateway config (github.com/snyk/api-gateway) to confirm the bumped concurrency won't push CLI users over rate limits. Here's what I found.
Rate-limit policy for our endpoints
POST /v1/test-dependencies, POST /v1/test-dep-graph, and PUT /v1/monitor-dependencies all match the default /v1/test* and /v1/monitor* prefix routes in helm/templates/route-tables/api-v1.yaml. None of them have a custom envoyMetadata.snyk.ratelimiter.keys.bucket override, so they fall into the default high bucket of the api-v1 rate-limit config.
- Keyed by
principal_id— per-authenticated-user (per-token), perdocs/rate-limiting.md. Not per-IP, not per-cluster. - Default bucket:
high. Limits per token:
| Window | Limit |
|---|---|
| Per second | 200 req/s |
| Per minute | 2,000 req/min (≈ 33/s sustained) |
| Per hour | 60,000 req/hour (≈ 16.7/s sustained) |
All three apply simultaneously (burst, 1-min, 1-hour ceilings).
The only adjacent exception is lowerDepGraphRateLimit.enabled (polaris-prod-pc-pc01-1 only), which drops /v1/monitor/dep-graph (with slash, not our endpoint) to extra_low. Doesn't affect us.
Headroom analysis for this PR
The default change is 5 → 10 in-flight requests per CLI invocation. The proposed env override (SNYK_REQUEST_CONCURRENCY) caps at 50.
| Scenario | Observed peak rate | Limit | Headroom |
|---|---|---|---|
| Wildfly container test, c=10 | ~13 req/s sustained | 200 req/s burst | 15× |
| Wildfly container test, c=20 | ~20 req/s sustained | 200 req/s burst | 10× |
| Wildfly container test, c=50 (max override) | ~50 req/s burst | 200 req/s burst | 4× |
For a single invocation, no setting in the supported range gets close to the per-second ceiling.
Other context from registry-side telemetry
For completeness, I checked Datadog for actual 429s on the registry pod:
| Endpoint | 30-day requests | 30-day 429s | 429 rate |
|---|---|---|---|
POST /test-dep-graph |
9.0 M | 24 | 0.00027% |
POST /test-dependencies |
1.65 M | 22 | 0.0013% |
PUT /monitor-dependencies |
1.04 M | 0 | 0% |
Caveat: the Envoy gateway returns 429s to clients before the request reaches the registry pod, so registry-side 429 counts under-represent total rejections. At the cluster level, ~1% of total traffic to static-default-registry-web-30100_api-gateway (~807M requests over 30 days) is rejected — but that's all paths combined, and per-endpoint attribution isn't available at the metric layer.
Verdict
The bump from 5 → 10 (and the 50 cap on the env override) is well within the per-token rate limit policy for these endpoints. Even at the highest supported override, a single CLI burst stays at ≤25% of the per-second ceiling and doesn't change per-minute/per-hour quota consumption.
Recommended follow-up (optional, separate PR)
The current retry logic in run-test.ts:265-296 retries on 500/502/503 but not 429. If a heavy user (CI fleet sharing a token) ever hits the per-hour bucket — which is plausible regardless of this PR — they get a hard failure today.
| import { | ||
| RETRY_ATTEMPTS, | ||
| RETRY_DELAY, | ||
| getRequestConcurrency, |
There was a problem hiding this comment.
Do we know why concurrency was not enabled for monitor-deps ?
There was a problem hiding this comment.
I assume just an oversight. I don't know
bdemeo12
left a comment
There was a problem hiding this comment.
LGTM! Just a few questions!
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
9b8b7f1 to
0523d40
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
86726c6 to
2450535
Compare
This comment has been minimized.
This comment has been minimized.
2450535 to
5ae8410
Compare
This comment has been minimized.
This comment has been minimized.
5ae8410 to
fcd0bad
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Rewrite monitorDependencies in src/lib/ecosystems/monitor.ts to fan out per-ScanResult /monitor-dependencies PUTs in parallel via pMap, bounded by the SNYK_REQUEST_CONCURRENCY limit (default 5). Per-ScanResult work is extracted into monitorOneScanResult for testability and clarity. Container images that produce many ScanResults (e.g. one per directory of JARs in fat-JAR-heavy images) previously incurred one full RTT per scan result, since the prior implementation used a nested for-loop with await. With bounded parallelism this collapses to ~ceil(N / concurrency) sequential batches, materially reducing wall-clock for large images. Error semantics are preserved: - 401 still throws AuthFailedError (terminates the run). - Other 4xx still throws MonitorError (terminates via pMap fail-fast). - 5xx and other non-4xx errors are accumulated per-ScanResult into the errors array, matching the prior continue-on-error behavior. Result order is preserved by pMap based on input order, so output remains deterministic regardless of completion order. Tests cover concurrency cap (default 5), env override via the internal SNYK_INTERNAL_REQUEST_CONCURRENCY contract that the wrapping Go CLI forwards, ordering preservation, 4xx fail-fast, and 5xx accumulation.
7026ae2 to
186c5fb
Compare
PR Reviewer Guide 🔍
|
What does this PR do?
Rewrites
monitorDependenciesinsrc/lib/ecosystems/monitor.tsto fan out per-ScanResultPUT /monitor-dependenciescalls in parallel viapMap, bounded by thegetRequestConcurrency()helper introduced in #6756 (default 5, user-overridable viaSNYK_REQUEST_CONCURRENCY).Effective change for users: container monitor goes from sequential (1 in-flight request) to 5 in-flight by default, with the same env-var override surface as
snyk container test.monitorOneScanResulthelper for testability.monitorinvocation is the common case anyway).pMap(scanResults, ..., { concurrency }).This PR is based on
feat/snyk-request-concurrency(#6756) so its diff shows only the monitor-flow changes.Why?
snyk container monitorpreviously ran fully sequentially — one full RTT per ScanResult — because of a nestedfor...of awaitloop. For large images with many ScanResults (e.g. fat-JAR-heavy images that produce one ScanResult per directory of JARs insnyk-docker-plugin'sgroupJarFingerprintsByPath), this is the dominant wall-clock cost.Benchmark —
quay.io/wildfly/wildfly:34.0.1.Final-jdk21(512 ScanResults)Sequential baseline measured against the released CLI (
snykv1.1303.2, prior code path). PR configurations measured against the locally-built PR binary, varying onlySNYK_REQUEST_CONCURRENCY. Re-monitoring updates existing project snapshots rather than creating duplicates.snykv1.1303.2, prior code)c=5, this PR)c=10)c=20)Each PR run successfully updated all 512 snapshots. The sequential run produced 510 snapshots; possibly minor version drift between v1.1303.2 and the PR binary's bundled
snyk-docker-plugin. No errors logged; doesn't affect the timing comparison.The default-of-5 row is the new behavior users will see without any opt-in.
Error semantics preserved
AuthFailedError(terminates the run).MonitorError(terminates the run viapMap's default fail-fast on rejection).errorsarray, matching prior continue-on-error behavior.Result ordering preserved
pMapreturns results in input order regardless of completion order, so the formatted output remains deterministic.Where should the reviewer start?
src/lib/ecosystems/monitor.ts—monitorDependenciesrewrite + newmonitorOneScanResult.test/jest/unit/ecosystems-monitor-docker.spec.ts— 5 new tests in aparallelization of monitor-dependencies requestsdescribe:SNYK_INTERNAL_REQUEST_CONCURRENCY=3override respected (the internal env var the TS code reads; the user-facingSNYK_REQUEST_CONCURRENCYis plumbed through Go in feat: introduce SNYK_REQUEST_CONCURRENCY for dependency request parallelism #6756).MonitorError).How should this be manually tested?
npx jest --selectProjects coreCli --testPathPattern 'test/jest/unit/ecosystems-monitor-docker'SNYK_REQUEST_CONCURRENCYis plumbed through:Risk assessment
Low. The change is scoped to a single function. Error semantics are preserved (verified by tests). Concurrency is bounded and configurable.
pMapis already used byrunTestso it's a familiar dependency in this codebase.Background
Depends on / based on #6756, which:
getRequestConcurrencyhelper and clamps,SNYK_REQUEST_CONCURRENCYenv var through the Go-side GAF configuration so the legacy CLI receives the resolved value via the internalSNYK_INTERNAL_REQUEST_CONCURRENCYenv var.Supersedes #6748:
Promise.all, which on a 500-ScanResult image would fire 500+ concurrent PUTs — a real risk of 429s, socket exhaustion, or thundering-herd backend load.Promise.allthe rest" hybrid pattern; once results are kept in input order viapMap, the special case isn't load-bearing.